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Abstract 



We reformulate a class of non-linear stochastic optimal control problems introduced by [T] as a KL 
minimization problem. As a result, the optimal control computation reduces to an inference computation 
and approximate inference methods can be applied to efficiently compute approximate optimal controls. We 
show that the path integral control problem introduced in [5] can be obtained as a special case of the KL 
control problem. We provide an example of a block stacking task where we demonstrate how approximate 
inference can be successfully applied to instances that are too large for exact computation. 

1 Introduction 

Stochastic optimal control theory deals with the problem to compute an optimal set of actions to attain some 
future goal. With each action and each state a cost is associated and the aim is to minimize the total cost. 
Examples are found in many contexts such as motor control tasks for robotics, planning and scheduling tasks 
or managing a financial portfolio. The computation of the optimal control is typically very difficult due to the 
size of the state space and the stochastic nature of the problem. 

The most common approach to compute the optimal control is through the Bellman equation. For the finite 
horizon discrete time case, this equation results from a dynamic programming argument that expresses the 
optimal cost-to- go (or value function) at time t in terms of the optimal cost-to-go at time t + 1. For the infinite 
horizon case, the value function is independent of time and the Bellman equation becomes a recursive equation. 
In continuous time, the Bellman equation becomes a partial differential equation. 

The computation of the optimal control requires the computation of the optimal value function which assigns 
a value to each state of the system. For high dimensional systems or for continuous systems the state space 
is huge and the above procedure cannot be directly applied. Common approaches to make the computation 
tractable are a function approximation approach where the value function is parametrized in terms of a number 
of fixed basis functions and thus reduces the Bellman computation to the estimation of these parameters only 
[3]. Another promising approach is to exploit graphical structure that is present in the problem to make the 
computation more efficient However, this graphical structure is not inherited by the value function, and 
thus the graphical representation of the value function is an approximation [5] . 

In this paper, we introduce a class of stochastic optimal control problems where the optimal control is 
expressed as a probability distribution p over future trajectories and where the control cost can be written as 
a KL divergence between p and some interaction terms. The optimal control is given by minimizing the KL 
divergence, which is equivalent to solving a probabilistic inference problem in a dynamic Bayesian network. 
Instead of solving the control problem with the Bellman equation, the optimal control is given in terms of 
(marginals of) a probability distribution over future trajectories. Thus, the formulation of the control problem 
as an inference problem directly suggest a number of well-known approximation methods, such as the variational 
method [B], belief propagation [7], CVM or GBP [5] or MCMC sampling methods. We refer to this class of 
problems as KL control problems. 

The class of control problems considered in this paper is identical as in [HI H] , who shows that the Bellman 
equation can be written as a KL divergence of probability distributions between two adjacent time slices and 
that the Bellman equation computes backwards messages in a chain as if it were an inference problem. The 
novel contribution of the present paper is to identify the total expected control cost up to the horizon time with 
a KL divergence instead of making this identification in the Bellman equation. The immediate consequence is 
that the optimal control problem is a graphical model inference problem (for this class of control problems) and 
that the resulting graphical model inference problem can be approximated using standard methods. 



1 



The equivalence of certain types of control problems to inference problems is well-known and goes back to 
Kalman for the linear quadratic Gaussian case [TO]. It relies on the exponential relation between the value 
function in the Bellman equation and posterior marginal probabilities in the DBN and was previously exploited 
in [2 [IT] for the non-linear continuous space and time Gaussian case and in [TJ for the discrete case. 

The paper is organized as follows. In section^ we review the general discrete time and discrete state control 
problem and derive the Bellman equation. Subsequently, we introduce the KL control problem and show that 
it is equivalent to |1J . In section 01 we show how the class of continuous space control problems previously 
considered in [2] can be obtained as a special case of the formulation of section [5] The main difference is that 
it has discrete time instead of the continuous time. In section [3] we consider the special case where the state x 
consists of components x = xi, . . . , x n that each act according to their own local dynamics and the cost has a 
(sparse) graphical model structure. For this case the control computation becomes a graphical model inference 
problem. We discuss some of the approximate inference methods that can be applied. In section [5] we apply 
the formulation to the task of stacking blocks onto a single pile. 



2 Control as KL minimization 

In this section, we introduce the class of control problems that can be shown to be equivalent to a KL mini- 
mization. We restrict ourselves to the discrete state and time formulation and the finite horizon. The infinite 
horizon case is obtained as limit of the horizon time going to infinity, when the dynamics and costs do not 
explicitly depend on time and there exist a sub set of absorbing states. 

Let x — 1, .... N be a finite set of states, x l denotes the state at time t. Denote by p t (x t+1 \x t ,u t ) the 
Markov transition probability at time t under control it* from state x 1 to state x t+1 . Let P(x 1 ' T \x°, u 0:T ~ 1 ) 
denote the probability to observe the trajectory x 1:T given initial state x° and control trajectory u° . 

If the system at time t is in state x and takes action u to state x' , there is an associated cost u, x', t). 
The total cost consists of a sum of terms, one for each time t. The control problem is to find the sequence 
u 0:T-i tha^ minimizes the expected future cost 

t it \ 

C{x° ,u°-- T - 1 ) = p(x 1:T \x°,u 0:T - 1 ) J2 R{x\u\x t+ \t) = I J2 R{x\u\ x t+1 ,t) \ (1) 

*=0 \t=0 / 

with the convention that R(x T , u T , x T+1 , T) — R(x T , T) is the cost of the final state. Note, that C depends on u 
in two ways: through R and through the probability distribution of the controlled trajectories p(x 1:T \x°, it 0:T_1 ). 

The optimal control is normally computed using the Bellman equation, which results from a dynamic pro- 
gramming argument. For this purpose, one considers an intermediate time t and defines the optimal cost-to-go 



J(x,t) = min C*(a;,tt* ) ( 2 ) 

u t:T — l 

C t (x*,u t ' T - 1 ) = p(x t+1:T \x t ,u t:T - 1 )f2R(x S ,u s ,x s + 1 ,s) (3) 

X t + 1:T g=i 

with C°(x°, it ^" 1 ) = C(x°, u 0:T_1 ) as the solution of the control problem from f to T for any state x. J is 
also known as the value function. One solves J recursively, by noting that J{x, T) = R(x, T) for all x and 

J(x\t) = min {Y j R(x s ,u s ,x s+ \s)\ = miny p t {x t+1 \x t ,u t )(R(x t ,u t ,x t+ \t) + J(x t+1 ,t + l)) (4) 

u t:T-l \ ^ / u t ^ 

\ S =t / x t+l 

Eq. [?]is called the Bellman Equation. Eq. [?] depends on the state x and on t, and therefore the optimal value 
it* that is computed by minimizing Eq. [4]depcnds on a;*, t. The optimal sequence u 0:T_1 is computed iteratively 
backwards from t = T — 1 to t = 0. Finally, the optimal control at time t — is given by u°(x°). 

We will now consider the restricted class of control problems for which the total cost R is given as the sum of 
a control dependent term and a state dependent term. We further assume the existence of a 'free' (uncontrolled) 
dynamics q t (x t+1 \x t ), which can by any stochastic dynamics. We quantify the control cost as the amount of 
deviation between p t (x t+1 \x t , u l ) and q t (x t+1 \x t ) in relative entropy sense. Thus, 

R{x\ u\ x t+ \t) = log P Y, + w?P + ^ *) i = 0, . . . ,T — 1 (5) 
q (x \x ) 
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Dynamics: p* (x l |ar* ,u* 1 ) 
Cost: C(u 0:T ) = (i?) 
I 

restricted class of problems 
I 

Free dynamics: q t (x t \x t ^ 1 ) 
C = KL(p\\qexp{-R)) 



dynamic programming — > Bellman Equation 



approximate inference 



I 

approximate J 

I 

Optimal u 



Figure 1: Overview of the approaches to computing the optimal control. Top left) The general optimal control problem 
is formulated as a state transition model p that depends on the control (or policy) u and a cost C(u) that consists of an 
expectation value with respect to the controlled dynamics of a state and control dependent cost R over a future horizon. 
The optimal control is given by the u that minimizes a cost C(u). Top right) The traditional approach to solve the 
control problem is to introduce the notion of cost-to-go or value function J, which satisfies the Bellman equation. The 
Bellman equation is derived using a dynamic programming argument. Bottom right) Often the state space is too large to 
be able to solve the Bellman equation. Typically, an approximate representation J is used to solve the Bellman equation 
which yields the optimal control. Bottom left) The approach in this paper is to consider a class of control problems for 
which the minimization of C with respect to u is equivalent to the minimization of a KL divergence with respect to p. 
The computation of the optimal control becomes a graphical model inference problem. 



with R(x,t) an arbitrary state dependent control cost and C becomes 

C(x°,p) = KL (p\\q) + (R) = gK^l* ) log ^Z^o) ( 6 ) 

?/>(a; 1: ' r |x u ) = q{x 1:T \x°)exp I -> ^ R{x\t) I (7) 



; 1:T |x°) = q(x x - T \x°)exp(^^R{x\t)^ 



Instead of assuming a parametric form for p(x l | cc* 1 , it 4-1 ) as a function of u and minimizing C with respect 
to u, we minimize C directly with respect to p, subject to the normalization constraint J2 X 1 ' T p(x 1:T \x°) = 1. 
The result is 

p(x^\x°) = ^L_^ 1:3 >°) (8) 

and the optimal cost 

C(x°,p) = -logZ( a ;o) = -log^g(x 1:T |x )exp(-^i?( a ;*,i)J (9) 

where Z(x°) is a normalization constant. In other words, the optimal control solution is the (normalized) 
product of the free dynamics and the exponentiated costs. It is a distribution that avoids states of high R, at 
the same time deviating from q as little as possible. 

The optimal cost Eq. [9] is minus the log partition sum. The partition sum is the expectation value of the 
exponentiated path costs X^tlo R{ xt 1 1) under the uncontrolled dynamics q. This is a surprising result, because 
it means that we have a closed form solution for the optimal cost-to-go C (a; , p) in terms of the known quantities 
q and R and one can thus estimate C(x°,p) by forward sampling under q. This result was previously obtained 
in [2] for a class of continuous stochastic control problems. It will be discussed as a special case of the KL 
control in section [U The difference between the KL control computation and the standard computation using 
the Bellman equation is schematically illustrated in Fig. [T] 

The optimal control at time t = is given by the marginal probability 



p(xV) = $>(x 1: V) (10) 



This is a standard graphical model inference problem, where the joint probability distribution p is a Markov 
chain with pair-wise potentials ^(x*, x t+1 ) = q t (x t+1 \x t ) exp(— R(x t+1 , t + 1)), t = 0, . . . ,T — 1. We can thus 
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compute p{x 1 \x°) by backward message passing 

P T (x T ) = 1 (11) 
/3*( x *) = ^^(x\x t+1 )p t+1 {x t+l ) (12) 

p{x l \x°) oc VV,* 1 )/? 1 ^) (13) 

The equivalence between the Bellman equations Eq. 2] and the backward message passing Eq. [TOJfor KL control 
problems was first established in [9] . 

The type of control solutions that one obtains depend on the choice of q and R. For instance, if we assume 
q = 1, Eq. [TO] yields p(x 1 \x°) oc exp(— i?(x 1 )). In other words, the optimal control solution depends on the 
immediate reward only. We will build more intuition about the type of control solutions that one can obtain in 
section 0] 

We note, that the optimal control p is proportional to the free dynamics q. One can therefore obtain an 
alternative formulation of the KL control problem by defining 

p\y\x, u) = q f (y\x) exp(u^) (14) 

with u* some unknown N x N matrix such that 

£>%|z,u) = l. (15) 
v 

Substituting Eq. [TOJin Eq. [5] yields a control cost in terms of a sequence of the matrices t = 0, . . . , T — 1 

c(x°,u°- T -i) = £x* 1:T |z°)(eW^+X>(^)) ( 16 ) 

x 1 - T \t=0 t=0 / 

and C is minimized subject to the constraints Eq. 1151 Note, that in this formulation the control i^ xy itself is 
the cost to move from state x to y. This cost term is linear in w*^, but the minimization with respect to M* y is 
nevertheless well-defined because C is bounded from below as a result of the normalization constraint Eq. [TO] 

3 Graphical model inference 

In typical control problems, x is an n-dimensional vector with components x — x\, . . . , x n . For instance, for a 
multi-joint arm, a;, may denote the state of each joint. For a multi-agent system, n may denote the state of 
each agent. In all such examples, Xi itself may be a multi-dimensional state vector. In such cases, the optimal 
control computation Eq. [TO] is intractable. 

However, the following assumptions are likely to be true 

• the uncontrolled dynamics factorizes over components 

n 

»=i 

• the interaction between components has a (sparse) graphical structure 

R{x,t) = } R a (x a ,t) 

at 

with a a subset of the indices 1, . . . , n and x a the corresponding subset of variables. 
Thus, ip in Eq. [7] has a graphical structure 

T-l n T 



v^ix°)=nn^^ +i i^)nn ex p(-^( a; - i )) 



t=0 i=l t=0 a 
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We can exploit the graphical structure when computing the marginal Eq. [10] using the junction tree method, 
which may be more efficient than simply using the backwards messages. 

Alternatively, we can use any of a large number of approximate inference methods to compute the optimal 
control, such as the variational approximation [6] , belief propagation [7] , generalized belief propagation or CVM 
[5J Q2] • In addition, Monte Carlo sampling methods can be usefully applied. An example of this approach, 
including an importance sampling scheme, was given in [T3] for the continuous class of control problems where 
the path integral Eq. [5] was estimated using a type of particle filtering. 



4 Continuous space formulation 

In [5] it was shown that a general class of non-linear stochastic control problems in continuous space and time 
can be solved as a path integral. The key is step in that derivation was to transform the continuous Hamilton- 
Jacobi-Bellman equation (which is a partial differential equation) to a linear diffusion-like equation, which is 
formally solved as a path integral. In this section, we show how this result is obtained as a special case of the 
KL control formulation. 

Let x denote an n-dimensional real vector with components xi,...,x n (components are denoted by sub- 
scripts) and we define a discrete time stochastic dynamics 

y = x + f{x,t) + u + (, (17) 

with / an arbitrary function, £ an n-dimensional Gaussian noise vector with covariance matrix v and u an 
n-dimensional control vector. Since the noise is Gaussian, the conditional probability of y given x under control 
u is a Gaussian distribution, and thus 



p\y\x,u) = N{y\x + f{x.t) +u,v) 



= JV(y\x + f(x > t),f)exp ( (y-x- f(x,t)) T u x u-\u T v 1 



2 



u 



q t (y\x)exp(n t xy (u)) 



1 



u xyi u ) = {V ~ x ~ f{x,t)) T v l u--u T v 1 u 

where we have written p l (y\x, u) as in Eq. [T4l and have defined the free dynamics q as the dynamics that results 
when u = in Eq. 1171 u xy (u) is a matrix from states x to states y (very large), parametrized by an n dimensional 
vector u. 

It is straightforward to compute the control cost Eq. [16] for this particular choice of p and q. The result is 



c{x\u™-i) = 5>(* 1:T I*°) £5(<) r ^V + 5>(^) (18) 

x^ T \t=0 4=0 / 

Eq . [TTl and IT51 arc similar to the path integral control problem discussed in [2J. Note, that the cost of control 
is quadratic in u, but of a particular form. One could in general write a quadratic form as ^u T Ru, with R an 
arbitrary positive definite n x n matrix. However, the KL control class restricts the choice of R to R oc v~ l . 
We note that time is discrete in the present formulation whereas the derivation in [2J made explicit use of the 
continuous time formulation and the limit dt — > 0. Thus, the present derivation shows that we can also apply 
the path integral approach to the discrete time version, with the path integral given by Eq. [9] 



5 Numerical results 

Consider the example of piling blocks into a tower. This is a classic AI planning task [14], and it will be 
instructive to see how this problem is solved as a control problem. 

Let there be n possible block locations on the one dimensional ring (line with periodic boundaries) as in 
figure [21 and let x\ > 0,i = 1, . . . , n, t = 0, . . . , T denote the height of stack i at time t. Let m be the total 
number of blocks. 

At iteration t, we allow to move one block from location fc* and move it to a neighboring location fc* + l l 
with I* = —1, 0, 1 (periodic boundary conditions). Given fc* , Z* and the old state a;' -1 , the new state is given as 

T t - T t-i _ 1 
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Figure 2: m blocks at n possible block locations with periodic boundary conditions. x\ = 0, . . . , m denotes the 
height of stack at location i at time t. The objective is to stack the initial block configuration (left) into a single 
stack (right) through a sequence of single block moves to adjacent positions (middle). 



and all other stacks unaltered. We use the uncontrolled distribution q to implement these allowed moves. For 
the purpose of memory efficiency, we introduce auxiliary variables s\ = —1,0,1 that indicates whether the stack 
height Xi is decremented, unchanged or incremented, respectively. The uncontrolled dynamics q becomes 

q(k*) = U(l,...,n) 
q(Z*) = W(-1,0,+1) 



i=l 

J I ut — A it 



q{s\\k t = i,l t = ±l) = 
q{a\\tf + l t = %,? = ±\) = 5 s t +1 

q(s\\k t ,l t ) — S s tfi otherwise 

where U(-) denotes the uniform distribution. With the selector bits k , I taking values from the uniform 
distribution, the transition from to x t is a mixture over the values of fc*, I 1 : 



fc*,z* t=l 

iK x i\ x i > s ;) = 5 x\, x t i r i + s \ 

Note, that there are combinations of and s\ that are forbidden: We cannot remove a block from a stack 
of size zero (a;* -1 = and s\ — —1) and we cannot move a block to a stack of size m (a;* -1 = m and s* = 1). 
If we restrict the values of x\ and in the last line above to 0, . . . , m these combinations are automatically 
forbidden. The graphical model is given in figure [3l 

Finally, we define the state cost as the entropy of the distribution of blocks 



R(x) = -AV^log^ 

771 777 



with A a positive number to indicate the strength. Since ~Ylii x i i s constant (no blocks are lost), the minimum 
entropy solution puts all blocks on one stack. 

The control problem is to find the distribution p that minimizes C in Eq. [6l For large A, the state costs 
dominate and the optimal p will be such that a single stack is constructed as fast as possible. For small A, the 
control costs dominate, and the optimal p may be different as we will see in the next example. 

In figure HJleft), we show the optimal control solution for a small example where exact inference is feasible. 
There arc n = 4 block locations and m — 8 blocks. The initial block configuration x° consists of 4 blocks on both 
locations 1 and 3. The horizon time is T — 10 and A = 10. The solution was computed using the junction tree 
method using [15j . Time runs horizontal (t = 1, . . . , 10 in the top two subfigures and t = 0, . . . , 10 in the bottom 
two subfigures) and shows in each column the posterior marginals p{k t ), fc* = 1, . . . , 4 (top), P — — 1, 0, 1 
(second) and ,i — 1, ... ,77 (third) in grey scale (darker shows higher values). For example, the posterior 
marginals for the first move t = 1 are pik 1 ) = 0.5 for k 1 — 1,3 and p{l v ) — 0.5 for I 1 = ±1 indicating that a 
block should be removed from either stack fc 1 = 1 or 3 and moved to stack 2 or 4 with equal probability. The 
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Figure 3: Graphical model for the block stacking problem. Time runs horizontal and stack positions vertical. 
At each time, the transition probability of x* to x t+1 is a mixture over the variables fc*, I*. 



n 


m 


T 


clique size 


Memory JT (Mb) 


Memory CVM (Mb) 


Max error 


Max error T=l 


4 


2 


11 


7 


17 


2.2 


0.0304 


0.0158 


4 


4 


11 


7 


132 


2.3 


0.2348 


0.1066 


6 


2 


11 


12 


680 


2.7 


0.9596 


0.0174 


G 


4 


11 


12 


15.000 


2.9 


0.9732 


0.1265 



Table 1: Memory use and CVM errors for some small block stacking examples. Horizon time T = 11. Initial 
block configuration is symmetric with ra/2 blocks on two stacks maximally separated. CVM was used with 50 
inner loops and an outer loop stop criterion of 0.00001 on the change of the CVM free energy. 

posterior marginals for the second move t = 2 are such as to take the block moved at t = 1 and put it onto the 
other stack, etc. 

Due to the symmetry of the initial configuration x° (both stacks 1 and 3 are equally high) the optimal 
control solution has this same symmetry. Thus, the optimal solution is the mixed policy that assigns equal 
probability to 4 moves. The deterministic policy to make any one of those moves is suboptimal. 

In order to actually stack the blocks, the symmetry must be broken and a particular block move must be pro- 
posed. The bottom row in figureBJleft) shows the configurations x l that result if one breaks the initial symmetry 
by moving a block from stack 3 to stack 2. Once the initial symmetry is broken, a unique block move sequence 
is computed by taking the MAP estimate conditioned on the move at t — 1: (p*, l f ) = argmax p(fc*, l 1 ^ 1 ) 
for t = 2, . . . , T. Note, that eight moves are required to stack all blocks on a single stack and that indeed the 
marginals p{l l ) are peaked on the value = for t > 8. 

In figure^ right) we show the same results, but for A — 2. In this case the cost of moving (given by the term 
KL(p\\q) has relatively more weight than in the previous example. Indeed, the marginals posterior of p(/ 1 ) is 
peaked on the value I 1 = indicating that the optimal control solution is to make no move, and no blocks are 
stacked at all. This is an example where the AI planning approach and the optimal control approach differ. 
Whereas the former will always aim to find a strategy to stack the blocks, the optimal control solution may or 
may not stack the blocks depending on the control cost relative to the (long term) state cost benefits. For the 
same reason, if one runs the optimal control problem with a horizon time T that is too short to move all blocks 
into a single stack, the optimal control solution will be not to move the blocks. 

Clearly, exact inference can only be used for small problem instances, since the computational requirements 
for exact computation scale very fast with n and m, as shown in Table [TJ In particular, memory use of the 
JT method scales very bad (this can admittedly be improved using an implementation of the JT method using 
sparse tables) . 

We therefore use the Cluster Variation method to compute the marginal posterior probabilities of p approx- 
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Figure 4: n — 4, m = 8,T = 10 and A = 10 (left) and A = 2 (right) using exact inference. The blocks 
are initialized in two stacks of height 4. Each subfigure shows the marginals p(k t ) (top), p(Z') (second) and 
(xj) , i = 1, . . . , n (third) and the MAP solution (bottom) for t = 1, . . . , T using a grey scale coding with white 
coding for zero and darker colors coding for higher values. 

imately. We use the minimal cluster size, i.e. the outer clusters are equal to the interaction potentials ip. The 
optimization of the CVM free energy is a non-convex optimization problem and we use the double loop method 
that was proposed in [16] . See Appendix [A] for a brief description of the CVM approximation and the double 
loop method. 

For the same two problems as in figure [U the CVM solution is given in figure \5\ First of all, note that the 
quality of the CVM solution deteriorates with t. For larger t it becomes more and more difficult for the CVM 
solution to correctly capture all the high order correlations that are present between the moves at different 
times. The maximal CVM error in the first time slice and in all time slices is given for a number of instances in 
table[TJ Although the maximal errors are large, the errors in the first time slice are sufficiently small to correctly 
propose a first move. These moves in figure [SJleft, right) coincide with the optimal moves in the first time slice 
in figure [4jleft, right), respectively. One can thus, obtain the optimal moves for all times by running CVM T 
times in the following way. Run CVM with horizon 1 : T and initial state x°; make a move by choosing k , I 1 
and the new state x 1 ; Rerun CVM with horizon 2 : T and initial state x 1 ; make a move by choosing k 2 , I 2 and 
the new state x 2 ; etc. 

We applied this approach to a large block stacking problem with n = 8, m — 40, T = 80 and A = 10. The 
results are shown in figure [SJ The computation time was approximately 1 hour per t iteration and memory use 
was approximately 27 Mb. This instance was too large to obtain exact results. We conclude, that although 
the cpu time is large, the CVM method is capable to yield an apparently good optimal control solution for this 
large instance. 

6 Discussion 

In this paper, we have shown the equivalence of a class of stochastic optimal control problems to a graphical 
model inference problem. As a result, efficient approximate inference methods can be applied to the intractable 
stochastic control computation. As we have seen, the class of KL control problems contains interesting special 
cases such as the class of continuous stochastic control problems discussed in section 0] and the type of AI 
planning tasks discussed in section [5] 

The class of KL control problems is restricted in the sense that there exist many stochastic control problems 
that are outside this class. For instance, continuous control problems of the type discussed in section |4] where 
the control cost is not of the form ^u T Ru but any other function of u or where the control acts non-linearly in 
the dynamics. Also, in the basic formulation of Eq. [TJ one can construct control problems where the functional 

1 The moves in figs.|4{left) and figs. [5]left) are different, but equivalent due to the symmetry of the initial state. 
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Figure 5: Optimal control solution computed with the CVM method for the problem instances described in 
figure [4] 
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Figure 6: A large block stacking instance, n = 8, m = 40, T = 80, A = 10 using CVM. 
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form of the controlled dynamics p t (x t+1 \x l , u*) is given as well as the cost of control R(x l , u*, x t+1 ,t). In general, 
there may then not exist a q t (x t+1 \x t ) such that Eq. [5] holds. Therefore, such control problems are not in the 
KL control class. 

We have used the Cluster Variation Method to approximate the computation of the marginal Eq. [10] for 
the block stacking problem. We have used the double loop method as described in [TB] to compute the cluster 
marginals. We found that the marginal computation is quite difficult compared to other problems that we 
have studied in the past (such as for instance in genetic linkage analysis |17| ) in the sense that relatively many 
inner loop iterations were required for convergence. We note, that BP does not give any useful results on this 
problem. 

The quality of the CVM solution in terms of marginal probabilities was poor, but sufficiently accurate to 
successfully stack the blocks. One can improve the CVM accuracy if needed by considering larger clusters. 

The type of approximate inference method that should be used depends very much on the details of the 
control problems. For instance, if one applies the continuous control formulation to robotics tasks, one may 
wish to model obstacles as subsets of the state space where R(x) = oo. For such non-differential problems a 
variational (Gaussian) approximation will not work, but we have recently obtained good results using EP [18] . 

It is well-known that approximate inference methods are particularly successful for sparse, or close to tree- 
like, structures and/or for not too strong interactions. Thus the efficiency that we can obtain to solve optimal 
control is tightly linked to the sparsity structure of the problem. This insight was previously understood in the 
RL community and the equivalence of KL control to a graphical model inference problem makes this intuition 
particularly clear. 

It is worth mentioning the implications of KL control for the special case of coordination of agents. If one 
considers the graphical structure that was proposed in section [3] with i labeling the different agents, the result 
of the control computation is the marginal distribution Eq. 1101 Clearly, this distribution generally does not 
factorize over agents: 

n 

p(xl rl ,•; 4) ? Y[p(xl\xl . . . , 4) (19) 

i=l 

This is the well-known agent coordination problem: the choice of action of one agent affects the optimal choice 
of other agents. 

One can avoid this problem and obtain a control solution where agents can make their choices independent 
of each other by restricting the class of controlled dynamics p to those that factorize over the agents 

n 

P (x^\x°)=i[ Pi (xr\x°) 

1=1 

and one minimizes C in Eq. [5] with respect to those p only. The resulting optimal control solution is suboptimal 
by construction, but 'solves' the coordination problem in the sense that Eq. rj5] is not violated and agents can 
choose their actions independent of one another. Note that this factorized restriction on p can also be viewed 
as a variational approximation of the optimal p |19] . 
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Figure 7: Left) Example of a small network and a choice of clusters for CVM. Middle) Intersections of clusters 
recursively define a set of sub-clusters. Right) -Fcvm is non-convex (blue curve) and is bounded by a convex 
function F Xn . 



A Cluster Variation Method 

In this appendix, we give a brief summary of the CVM method and the double loop approach. For a more 
complete description see [T2l [16] . 

The cluster variation method replaces the probability distribution p(x) by a large number of (overlapping) 
probability distributions, each describing the interaction between a small number of variables. 

p(x) ps {p a (x a ),a = 1, . . .} 

with each a a subset of the indices 1, . . . ,n, x a the corresponding subset of variables and p a the probability 
distribution on x a . The set of clusters is denoted by B. One denotes the set of all intersections of pairs of 
clusters in B, as well as intersections of intersections by M. Fig. [TJ^eft gives an example of a small directed 
graphical model, where B consists of 4 clusters and M consists of 5 sub-clusters (Fig. [7]Vliddle). 

Since p is the minimum of a KL divergence with interactions "0, the approximation to p is obtained by 
minimizing the an approximation to the KL divergence 

F[p) = J2p(x)\og^^F c U{ Pa }) 
subject to normalization and consistency constraints: 

yipqjXq) = 1, Pa{xp)=pp{Xp), /ICQ, Pa(x a ) > 

The numbers ap are called the Mobius or overcounting numbers. They can be recursively computed from the 
formula 

1= ^ a a , VfieBUM 

a£BUM,aD/3 

The minimization of F cvm subject to the linear constraints is a non-convex optimization problem. A guar- 
anteed convergent approach to minimize F cvm is to upper-bound it by a convex function F Xo that touches at 
the current value xq- 

F cvm {x) < F XQ (x) F cvm (x ) = F Xo (x ) 

as is illustrated in fig.lTpight. Optimizing F xo (x) with respect to x under linear constraints is a convex problem 
that can be solved using the dual approach (inner loop). The solution a;*(a;o) of this convex sub- problem is 
guaranteed to decrease F cvm : 

F cvm (x Q ) = F x „(xq) > F Xo (x*(x )) > F cvm (x*(x )) 
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Based on x*(xq) a new convex upper bound is computed (outer loop). This is called a double loop method. 
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